Main Analysis
Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc.
• The guidelines for the Executive Summary above do NOT apply to exploratory data analysis. Your main concern is designing graphs that reveal patterns and trends.
• As noted in Hmk #4, do not use circles, that is: bubbles, pie charts, or polar coordinates.
• Use stacked bar charts sparingly. Try grouped bar charts and faceting as alternatives, and only choose stacked bar charts if they truly do a better job than the alternatives for observing patterns.
Data Cleaning
Since the data is very messy, we put many effort on cleaning and extract useful infomation for analysis.
- Convert to correct type
- Consolidate name, region, date
Join same region
region_str <- "africa|asia|canada|latin america (excl mexico)|europe|mexico|middle east|oceania"
inbound_region <- tidy_ntto_inbound_m %>%
filter(grepl(region_str, MixRegion)) %>%
select(Region=MixRegion, Year, Date, Inbound) %>%
group_by(Region, Year, Date) %>%
summarise(TotalInbound=sum(Inbound)) %>%
ungroup
outbound_region <- tidy_ntto_outbound_m %>%
select(Region, Year, Date, Outbound) %>%
group_by(Region, Year, Date) %>%
summarise(TotalOutbound=sum(Outbound)) %>%
ungroup
regional_travel <- inner_join(inbound_region, outbound_region,
by=c("Region"="Region", "Year"="Year", "Date"="Date"))
Challenges
There are several challenges in our project:
- Due to the problems in the data set such as inconsistency, we have to spend much time in cleaning and re-organizing it, which makes the work tedious and laborious.
- We need country level data in some graphs; however, what we can acquire from the dataset is region level. In that case, we have to project the data onto the whole region, which makes the analysis not comprehensive and detailed.
- Shiny is a great tool for creating interactive data visualizations in R; however, we do not have much experience in it, and therefore have to spend time learning it, which is not easy in such a short time.
Analysis
Travel and Tourism Analysis
regional_travel %>%
select(Year, TotalInbound, TotalOutbound) %>%
group_by(Year) %>%
summarise(TotalInbound=sum(TotalInbound), TotalOutbound=sum(TotalOutbound)) %>%
plot_ly(x = ~Year, y = ~TotalInbound, type = 'bar', name = 'Inbound', marker = list(color = 'rgb(55, 83, 109)')) %>%
add_trace(y = ~TotalOutbound, name = 'Outbound', marker = list(color = 'rgb(26, 118, 255)')) %>%
layout(title = 'Yearly Inbound and Outbound',
xaxis = list(
title = "",
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
yaxis = list(
title = 'Number of People',
titlefont = list(
size = 16,
color = 'rgb(107, 107, 107)'),
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
legend = list(orientation = 'h', x = 0, y = 1,
bgcolor = 'rgba(255, 255, 255, 0)', bordercolor = 'rgba(255, 255, 255, 0)'),
barmode = 'group', bargap = 0.15)
The United States is one of the largest destinations for visitors and has a large amount outbound journeys as well. The numbers of inbound and outbound have increased a lot from 2009 to 2010. Between 2010 to 2013, the number of international visitors raised a little bit each year however the amound of outbound is kind of stable. Since 2013, both of them grown a lot and finally achieved 77.5 million international visitations and 85.6 million outbound travellers in 2015.
Naturally we start wondering what are the most popular destination for americans and where are these international vistors come from? To answer these questions, we break down the graph into smaller regions.
p1 <- inbound_region %>%
spread(Region, TotalInbound) %>%
filter(Date>'2008-11') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~africa, name='africa', mode='lines', line = list(color="gray", width = 1)) %>%
add_trace(y=~asia, name='asia', mode='lines', line = list(color="red", width = 1)) %>%
add_trace(y=~canada, name='canada', mode='lines', line = list(color="orange", width = 1)) %>%
add_trace(y=~europe, name='europe', mode='lines', line = list(color="pink", width = 1)) %>%
add_trace(y=~`latin america excl mexico`, name='latin america excl mexico', mode='lines', line = list(color="green", width = 1)) %>%
add_trace(y=~mexico, name='mexico', mode='lines', line = list(color="purple", width = 1)) %>%
add_trace(y=~`middle east`, name='middle east', mode='lines', line = list(color="black", width = 1)) %>%
add_trace(y=~oceania, name='oceania', mode='lines', line = list(color="blue", width = 1))
p2 <- outbound_region %>% spread(Region, TotalOutbound) %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~africa, name='africa', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~asia, name='asia', mode='lines', line = list(color="red", width = 1), showlegend=F) %>%
add_trace(y=~canada, name='canada', mode='lines', line = list(color="orange", width = 1), showlegend=F) %>%
add_trace(y=~europe, name='europe', mode='lines', line = list(color="pink", width = 1), showlegend=F) %>%
add_trace(y=~`latin america excl mexico`, name='latin america excl mexico', mode='lines', line = list(color="green", width = 1), showlegend=F) %>%
add_trace(y=~mexico, name='mexico', mode='lines', line = list(color="purple", width = 1), showlegend=F) %>%
add_trace(y=~`middle east`, name='middle east', mode='lines', line = list(color="black", width = 1), showlegend=F) %>%
add_trace(y=~oceania, name='oceania', mode='lines', line = list(color="blue", width = 1), showlegend=F)
subplot(p1, p2, nrows=2, shareX=T) %>%
layout(title = "Inbound v.s. Outbound",
yaxis = list(title = "Inbound"),
yaxis2 = list(title = "Outbound"),
legend = list(orientation = 'h')
)
After separating each region out, we observed several things:
- Most of the international travellers are come from Canada, Mexico, Europe, and Asia.
- Mexico, Canada, Europe, and Latin America except Mexico are the top destinations for americans.
- Seasonality exists in each line. Usually peak is reached in summer. For example, every year of July, the number of canada vistors reaches its highest peak of the year.
- A boom in amount of visitors from Latin America except Mexico in the begining of 2014 and a boom in amount of people travel to Mexico in the start of 2010.
We further draws inbound and outbound per region to better explore the hidden pattern individually.
p1 <- regional_travel %>%
filter(Region=='africa') %>%
plot_ly(x = ~as.POSIXct(Date), height = 1000) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1)) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1)) %>%
layout(autosize=F)
p2 <- regional_travel %>%
filter(Region=='asia') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p3 <- regional_travel %>%
filter(Region=='canada') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p4 <- regional_travel %>%
filter(Region=='europe') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p5 <- regional_travel %>%
filter(Region=='latin america excl mexico') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p6 <- regional_travel %>%
filter(Region=='mexico') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p7 <- regional_travel %>%
filter(Region=='middle east') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p8 <- regional_travel %>%
filter(Region=='oceania') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
subplot(p1, p2, p3, p4, p5, p6, p7, p8, nrows=8) %>%
layout(title = "Regional Inbound and Outbound",
yaxis = list(title = "Africa"),
yaxis2 = list(title = "Asia"),
yaxis3 = list(title = "Canada"),
yaxis4 = list(title = "Europe"),
yaxis5 = list(title = "Latin America"),
yaxis6 = list(title = "Mexico"),
yaxis7 = list(title = "Middle East"),
yaxis8 = list(title = "Oceania"),
legend = list(orientation = 'h', x = 0, y = 1.005)
)
Africa: There are more and more people from Africa travel to U.S. since 2013, but the number of american who go to Africa is very stable for past 7 years.
Asia: More and more Asias come to U.S, however, less Americans go to Asia after May 2010. Another thing to notice is the number of visitors from Asia grows faster in July compare to other month. This is probabaly due to the increase of foreign students.
Canada: Not many people go to Canada during year 2009. It may infuluenced by the Global financial crisis in 2009.
Europe: Different than other regions, the trend of inbound and outbound is very close to each other, which means that the amount of people leave
Latin America except Mexico: A huge increase occured in 2014. Since it boosted too much, we question the correctness of the data source.
Mexico: A boom of number of outbound happens in the end of 2009. We did a lot research online and found many news with topics related to More Mexicans Leaving Than Coming to the U.S during that time. We suspicious this could be the reason. Unlike other regions, there’s no obvious peak for Mexico.
Middle East: Both the number of inbound and outbound increase by year.
Oceania: Like Asia, less of U.S. citizens visit Oceania region since mid 2010.
Spend Analysis
yearly_spend <- tidy_ntto_spend_y %>%
filter(Region!='european union', Region!='south-central america', Region!='overseas') %>%
mutate(Region=recode(Region, "asia-pacific"="asia"), Spend=Spend*1000000) %>%
select(-Missing) %>%
arrange(Region, Year, Type, Category)
yearly_spend %>%
group_by(Type, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Type, TotalSpend) %>%
ungroup %>%
plot_ly(x = ~Year) %>%
add_trace(y=~`Payments (imports)`, type="scatter", name='Payments (imports)', mode = 'lines+markers', line = list(color="blue", width = 2)) %>%
add_trace(y=~`Receipts (exports)`, type="scatter", name='Receipts (exports)', mode = 'lines+markers', line = list(width = 2)) %>%
layout(title = "Yearly Spending",
xaxis = list(title = "Year"), yaxis = list(title = "Spend"),
legend = list(orientation = 'h', x = 0.5, y = 1.005))
From the graph we can see that more money has been spent by international vistors in U.S. than americans travel outside. Spend increases by year overall, however, there’s a big drop in 2009 when the Global Financial Crisis happens.
p1 <- yearly_spend %>%
filter(Type=="Payments (imports)") %>%
select(-Type) %>%
group_by(Region, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Region, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~africa, type="scatter", name='africa', mode = 'lines+markers', line = list(color="gray", width = 2)) %>%
add_trace(y=~asia, type="scatter", name='asia', mode = 'lines+markers', line = list(color="red", width = 2)) %>%
add_trace(y=~europe, type="scatter", name='europe', mode = 'lines+markers', line = list(color="blue", width = 2)) %>%
add_trace(y=~`latin america`, type="scatter", name='latin america', mode = 'lines+markers', line = list(color="green", width = 2)) %>%
add_trace(y=~`middle east`, type="scatter", name='middle east', mode = 'lines+markers', line = list(color="orange", width = 2))
p2 <- yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type) %>%
group_by(Region, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Region, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~africa, type="scatter", name='africa', mode = 'lines+markers', line = list(color="gray", width = 2), showlegend=F) %>%
add_trace(y=~asia, type="scatter", name='asia', mode = 'lines+markers', line = list(color="red", width = 2), showlegend=F) %>%
add_trace(y=~europe, type="scatter", name='europe', mode = 'lines+markers', line = list(color="blue", width = 2), showlegend=F) %>%
add_trace(y=~`latin america`, type="scatter", name='latin america', mode = 'lines+markers', line = list(color="green", width = 2), showlegend=F) %>%
add_trace(y=~`middle east`, type="scatter", name='middle east', mode = 'lines+markers', line = list(color="orange", width = 2), showlegend=F)
subplot(p1, p2, shareY=T) %>%
layout(title = "Payments vs Receipts by Region",
yaxis = list(title = "Spend"),
legend = list(orientation = 'h', y=-0.15)
)
In fact, payments is quiet stable by year, these’s no big jump or boom, however receipts is a different story. After breaking down by region, it is clear that, the total amount and increasing rate of Asia region is much more than others. And the Global Financial Crisis has a big influence on the money spent by visitors from Latin America.
p1 <- yearly_spend %>%
filter(Type=="Payments (imports)") %>%
select(-Type, -Region) %>%
group_by(Category, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Category, TotalSpend) %>%
ungroup %>%
plot_ly(x = ~Year) %>%
add_trace(y = ~Education, name = 'Education', type = 'scatter', mode = 'lines+markers', line = list(color="blue")) %>%
add_trace(y = ~`Medical/Short-Term Workers`, name = 'Medical/Short-Term Workers', type = 'scatter', mode = 'lines+markers', line = list(color="orange")) %>%
add_trace(y = ~`Other Business/Other Personal Travel`, name = 'Other Business/Other Personal Travel', type = 'scatter', mode = 'lines+markers', line = list(color="green"))
p2 <- yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type, -Region) %>%
group_by(Category, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Category, TotalSpend) %>%
ungroup %>%
plot_ly(x = ~Year) %>%
add_trace(y = ~Education, name = 'Education', type = 'scatter', mode = 'lines+markers', line = list(color="blue"), showlegend=F) %>%
add_trace(y = ~`Medical/Short-Term Workers`, name = 'Medical/Short-Term Workers', type = 'scatter', mode = 'lines+markers', line = list(color="orange"), showlegend=F) %>%
add_trace(y = ~`Other Business/Other Personal Travel`, name = 'Other Business/Other Personal Travel', type = 'scatter', mode = 'lines+markers', line = list(color="green"), showlegend=F)
subplot(p1, p2, shareY=T) %>%
layout(title = "Payments vs Receipts by Category",
yaxis = list(title = "Spend"),
legend = list(orientation = 'h', y=-0.15)
)
Finnaly, we look at three categories of spend. Other Business/Other Personal Travel has the biggest amount in both payments and receipts. It would be nice if this category can be break down to smaller sub groups, but unfortunately we do not have access to more detailed data. One thing is worth to mention is, unlike other two categories, the Global Financial Crisis did not influence Education.